24 research outputs found

    Fisher Kernels and Probabilistic Latent Semantic Models

    Get PDF
    Tasks that rely on semantic content of documents, notably Information Retrieval and Document Classification, can benefit from a good account of document context, i.e. the semantic association between documents. To this effect, the scheme of latent semantics blends individual words appearing throughout a document collection into latent topics, thus providing a way to handle documents that is less constrained than the conventional approach by the mere appearance of such or such word. Probabilistic latent semantic models take the matter further by providing assumptions on how the documents observed in the collection would have been generated. This allows derivation of inference algorithms that can fit the model parameters to the observed document collection; with their values set, these parameters can then be used to compute the similarities between documents. The Fisher kernels, similarity functions rooted in information geometry, constitute good candidates to measure the similarity between documents in the framework of probabilistic latent semantic models. In this context, we study the use of Fisher kernels for the Probabilistic Latent Semantic Indexing (PLSI) model. By thoroughly analysing the generative process of PLSI, we derive the proper Fisher kernel for PLSI and expose the hypotheses that relate former work to this kernel. In particular, we confirm that the Fisher information matrix (FIM) should not be approximated by the identity in the case of PLSI. We also study the impact on the performances of the Fisher kernel of the contribution of the latent topics and the one of the distribution of words among the topics; eventually, we provide empirical evidence, and theoretical arguments, showing that the Fisher kernel originally published by Hofmann, corrected to account for FIM, is the best of the PLSI Fisher kernels. It can compete with the strong BM25 baseline, and even significantly outperforms it when documents sharing few words must be matched. We further study of PLSI document similarities by applying the Language model approach. This approach shuns the usual IR paradigm that considers documents and queries to be of a similar nature. Instead, they consider documents as being representative of language models, and use probabilistic tools to determine which of these models would have generated the query with highest probability. Using this scheme in the framework of PLSI provides a way to bypass the issue of query representation, which constitutes one of the specific challenges of PLSI. We find the Language model approach to perform as well as the best of the Fisher kernels when enough latent categories are provided. Eventually, we propose a new probabilistic latent semantic model consisting in a mixture of Smoothed Dirichlet distributions which, by better modeling word burstiness, provides a more realistic model of empirical observations on real document collections than the usually used Multinomials

    Dictionary-Ontology Cross-Enrichment Using TLFi and WOLF to enrich one another

    Get PDF
    International audienceIt has been known since Ide and Veronis that it is impossible to automatically extract an ontology structure from a dictionary, because that information is simply not present. We at- tempt to extract structure elements from a dictionary using clues taken from a formal ontology, and use these elements to match dictionary definitions to ontology synsets; this allows us to enrich the ontology with dictionary definitions, assign ontological structure to the dictionary, and disambiguate elements of definitions and synsets

    Inclusion de sens dans la représentation de documents textuels : état de l'art

    Get PDF
    Ce document donne un aperçu de l'état de l'art dans le domaine de la représentation du sens dans les documents textuels

    Rôle de la matrice d'information et pondération des composantes dans les noyaux de Fisher pour PLSI

    Get PDF
    ABSTRACT. An information-geometric approach for document similarities in the framework of “Probabilistic Latent Semantic Indexing” was first proposed by T. Hofmann (2000) and later extended (“revisited”) by Nyffenegger et al. (2006). This paper presents an in-depth study and revision of these models by (1) providing a simpler unified description framework, (2) investigating the role of the Fisher Information Matrix G(θ), and (3) analyzing the impact of latent “topic” parameters in such models. It furthermore provides new experimental results on larger collections coming from the TREC–AP evaluation corpus

    Utilisation de PLSI en recherche d'information

    Get PDF
    The PLSI model (“Probabilistic Latent Semantic Indexing”) offers a document indexing scheme based on probabilistic latent category models. It entailed applications in diverse fields, notably in information retrieval (IR). Nevertheless, PLSI cannot process documents not seen during parameter inference, a major liability for queries in IR. A method known as “folding-in” allows to circumvent this problem up to a point, but has its own weaknesses. The present paper introduces a new document-query similarity measure for PLSI based on language models that entirely avoids the problem a query projection. We compare this similarity to Fisher kernels, the state of the art similarities for PLSI. Moreover, we present an evaluation of PLSI on a particularly large training set of almost 7500 document and over one million term occurrence large, created from the TREC–AP collection

    Topian 0.1 Reference Manual

    Get PDF
    This document describes Topian ("Topic-based Model layer for Xapian"), a software layer intended to add support for topical models to Xapian

    Free Software for research in Information Retrieval and Textual Clustering

    Get PDF
    The document provides an overview of the main Free ("Open Source") software of interest for research in Information Retrieval, as well as some background on the context. I provides a guideline for choosing appropriate tools

    Mercury dynamics in a San Francisco estuary tidal wetland : assessing dynamics using in situ measurements

    Get PDF
    © The Author(s), 2012. This article is distributed under the terms of the Creative Commons Attribution License. The definitive version was published in Estuaries and Coasts 35 (2012): 1036-1048, doi:10.1007/s12237-012-9501-3.We used high-resolution in situ measurements of turbidity and fluorescent dissolved organic matter (FDOM) to quantitatively estimate the tidally driven exchange of mercury (Hg) between the waters of the San Francisco estuary and Browns Island, a tidal wetland. Turbidity and FDOM—representative of particle-associated and filter-passing Hg, respectively—together predicted 94 % of the observed variability in measured total mercury concentration in unfiltered water samples (UTHg) collected during a single tidal cycle in spring, fall, and winter, 2005–2006. Continuous in situ turbidity and FDOM data spanning at least a full spring-neap period were used to generate UTHg concentration time series using this relationship, and then combined with water discharge measurements to calculate Hg fluxes in each season. Wetlands are generally considered to be sinks for sediment and associated mercury. However, during the three periods of monitoring, Browns Island wetland did not appreciably accumulate Hg. Instead, gradual tidally driven export of UTHg from the wetland offset the large episodic on-island fluxes associated with high wind events. Exports were highest during large spring tides, when ebbing waters relatively enriched in FDOM, dissolved organic carbon (DOC), and filter-passing mercury drained from the marsh into the open waters of the estuary. On-island flux of UTHg, which was largely particle-associated, was highest during strong winds coincident with flood tides. Our results demonstrate that processes driving UTHg fluxes in tidal wetlands encompass both the dissolved and particulate phases and multiple timescales, necessitating longer term monitoring to adequately quantify fluxes.This work was supported by funding from the California Bay Delta Authority Ecosystem Restoration and Drinking Water Programs (grant ERP-00- G01) and matching funds from the United States Geological Survey Cooperative Research Program

    The Importance of the Stem Cell Marker Prominin-1/CD133 in the Uptake of Transferrin and in Iron Metabolism in Human Colon Cancer Caco-2 Cells

    Get PDF
    As the pentaspan stem cell marker CD133 was shown to bind cholesterol and to localize in plasma membrane protrusions, we investigated a possible function for CD133 in endocytosis. Using the CD133 siRNA knockdown strategy and non-differentiated human colon cancer Caco-2 cells that constitutively over-expressed CD133, we provide for the first time direct evidence for a role of CD133 in the intracellular accumulation of fluorescently labeled extracellular compounds. Assessed using AC133 monoclonal antibody, CD133 knockdown was shown to improve Alexa488-transferrin (Tf) uptake in Caco-2 cells but had no impact on FITC-dextran or FITC-cholera-toxin. Absence of effect of the CD133 knockdown on Tf recycling established a role for CD133 in inhibiting Tf endocytosis rather than in stimulating Tf exocytosis. Use of previously identified inhibitors of known endocytic pathways and the positive impact of CD133 knockdown on cellular uptake of clathrin-endocytosed synthetic lipid nanocapsules supported that CD133 impact on endocytosis was primarily ascribed to the clathrin pathway. Also, cholesterol extraction with methyl-β-cyclodextrine up regulated Tf uptake at greater intensity in the CD133high situation than in the CD133low situation, thus suggesting a role for cholesterol in the inhibitory effect of CD133 on endocytosis. Interestingly, cell treatment with the AC133 antibody down regulated Tf uptake, thus demonstrating that direct extracellular binding to CD133 could affect endocytosis. Moreover, flow cytometry and confocal microscopy established that down regulation of CD133 improved the accessibility to the TfR from the extracellular space, providing a mechanism by which CD133 inhibited Tf uptake. As Tf is involved in supplying iron to the cell, effects of iron supplementation and deprivation on CD133/AC133 expression were investigated. Both demonstrated a dose-dependent down regulation here discussed to the light of transcriptional and post-transciptional effects. Taken together, these data extend our knowledge of the function of CD133 and underline the interest of further exploring the CD133-Tf-iron network
    corecore